"Data-snooping, technical trading rule performance and the bootstrap"¶import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
dat = pd.read_csv('Amazon.csv')
dat = dat[['Date','Close']]
dat.columns = ['ds','y']
dat.tail()
| ds | y | |
|---|---|---|
| 6150 | 2021-10-21 | 3435.010010 |
| 6151 | 2021-10-22 | 3335.550049 |
| 6152 | 2021-10-25 | 3320.370117 |
| 6153 | 2021-10-26 | 3376.070068 |
| 6154 | 2021-10-27 | 3396.189941 |
plt.figure(figsize=(12, 6))
plt.plot(dat.index, dat['y'], label='Amazon Stock Price')
plt.xlabel('Time')
plt.ylabel('Price')
plt.title('Time Series Plot')
plt.legend()
plt.show()
dat['SMA'] = dat.iloc[:,1].rolling(window=100).mean()
dat['diff'] = dat['y'] - dat['SMA']
dat[['y','SMA']].plot()
print(f"There are {len(dat)} cols in this dataset.\n")
print("Since the data contains 6155 cols, I am setting \"window\" parameter to be 100, \nwhich means using the avg of past 100 data points to ploting the SMA line.")
There are 6155 cols in this dataset. Since the data contains 6155 cols, I am setting "window" parameter to be 100, which means using the avg of past 100 data points to ploting the SMA line.
dat['diff'].hist()
plt.title('The distribution of diff')
print("After calculate the differences between the actual and the SMA. \nThe histogram shows the majority of the data are above or below the SMA by about 200.")
After calculate the differences between the actual and the SMA. The histogram shows the majority of the data are above or below the SMA by about 200.
dat['upper'] = dat['SMA'] + 200
dat['lower'] = dat['SMA'] - 200
dat[100:200]
| ds | y | SMA | diff | upper | lower | |
|---|---|---|---|---|---|---|
| 100 | 1997-10-07 | 4.057292 | 2.367604 | 1.689688 | 202.367604 | -197.632396 |
| 101 | 1997-10-08 | 4.005208 | 2.390365 | 1.614843 | 202.390365 | -197.609635 |
| 102 | 1997-10-09 | 3.750000 | 2.410781 | 1.339219 | 202.410781 | -197.589219 |
| 103 | 1997-10-10 | 3.901042 | 2.433438 | 1.467604 | 202.433438 | -197.566562 |
| 104 | 1997-10-13 | 4.000000 | 2.459167 | 1.540833 | 202.459167 | -197.540833 |
| ... | ... | ... | ... | ... | ... | ... |
| 195 | 1998-02-24 | 5.406250 | 4.604948 | 0.801302 | 204.604948 | -195.395052 |
| 196 | 1998-02-25 | 5.489583 | 4.619635 | 0.869948 | 204.619635 | -195.380365 |
| 197 | 1998-02-26 | 6.062500 | 4.640156 | 1.422344 | 204.640156 | -195.359844 |
| 198 | 1998-02-27 | 6.416667 | 4.664167 | 1.752500 | 204.664167 | -195.335833 |
| 199 | 1998-03-02 | 6.354167 | 4.686458 | 1.667709 | 204.686458 | -195.313542 |
100 rows × 6 columns
def plot_it():
plt.plot(dat['y'],'go',markersize=2,label='Actual')
plt.fill_between(
np.arange(dat.shape[0]), dat['lower'], dat['upper'], alpha=0.5, color="r",
label="Predicted interval")
plt.xlabel("Ordered samples.")
plt.ylabel("Values and prediction intervals.")
plt.show()
plot_it()
print("Above is the tolerance band which has revealed the outliers.\n")
print("After we draw the tolerance band, we can clearly see the trend within our dataset.\nTo concludes, the Amazons stock price using the SMA method reveals that a clear upward patterns.")
Above is the tolerance band which has revealed the outliers. After we draw the tolerance band, we can clearly see the trend within our dataset. To concludes, the Amazons stock price using the SMA method reveals that a clear upward patterns.
from statsmodels.tsa.api import SimpleExpSmoothing
import pandas as pd
import numpy as np
dat = pd.read_csv('Amazon.csv')
dat = dat[['Date','Close']]
dat.columns = ['ds','y']
dat.tail()
| ds | y | |
|---|---|---|
| 6150 | 2021-10-21 | 3435.010010 |
| 6151 | 2021-10-22 | 3335.550049 |
| 6152 | 2021-10-25 | 3320.370117 |
| 6153 | 2021-10-26 | 3376.070068 |
| 6154 | 2021-10-27 | 3396.189941 |
EMAfit = SimpleExpSmoothing(dat['y']).fit(smoothing_level=0.2, optimized=False)
EMA = EMAfit.forecast(3).rename(r'$\alpha=0.2$')
dat['EMA'] = EMAfit.predict(start=0)
dat['diff'] = dat['y'] - dat['EMA']
plt.figure(figsize=(10, 6))
plt.plot(dat['y'], label='Original Data', marker='o', linestyle='-', color='b', alpha=0.7)
plt.plot(dat['EMA'], label='EMA', marker='o', linestyle='-', color='r', alpha=0.7)
plt.xlabel('Time')
plt.ylabel('Value')
plt.legend()
plt.title('Original Data vs. EMA Smoothed Data')
plt.grid(True)
plt.tight_layout()
plt.show()
print("By setting the same smoothing level as notes did,\nwhich is 0.2, means that the EMA is giving more weight to recent observations in the time series.\nIn this case, the most recent data point in the series will have a weight of 20% in the calculation of the EMA.\nThe second most recent data point will have a weight of 20% * (1 - 0.2) = 16%.\nThe third most recent data point will have a weight of 16% * (1 - 0.2) = 12.8%.\nSo on so forth.\n")
print("A lower smoothing factor like 0.2 results in a more responsive EMA that reacts quickly to changes in the data,\nwhile a higher smoothing factor like 0.5 would make the EMA smoother and less responsive to short-term fluctuations. ")
By setting the same smoothing level as notes did, which is 0.2, means that the EMA is giving more weight to recent observations in the time series. In this case, the most recent data point in the series will have a weight of 20% in the calculation of the EMA. The second most recent data point will have a weight of 20% * (1 - 0.2) = 16%. The third most recent data point will have a weight of 16% * (1 - 0.2) = 12.8%. So on so forth. A lower smoothing factor like 0.2 results in a more responsive EMA that reacts quickly to changes in the data, while a higher smoothing factor like 0.5 would make the EMA smoother and less responsive to short-term fluctuations.
dat['diff'].hist()
plt.title('The distribution of diff')
print("We can observe that the predictions and the histogram of the EMA is different from SMA.\nAt this time I used 100 to get the tolerance band.")
We can observe that the predictions and the histogram of the EMA is different from SMA. At this time I used 100 to get the tolerance band.
dat['upper'] = dat['EMA'] + 100
dat['lower'] = dat['EMA'] - 100
plot_it()
import pandas as pd
import statsmodels.api as sm
dat = pd.read_csv('Amazon.csv')
dat = dat[['Date', 'Close']]
dat.columns = ['ds', 'y']
dat = dat.reset_index(drop=True)
# Convert 'ds' column to datetime
dat['ds'] = pd.to_datetime(dat['ds'], format='%Y-%m-%d')
# Set the datetime index with frequency='D' (daily)
dat = dat.set_index('ds').asfreq('D')
# Fill missing values with forward fill
dat['y'].fillna(method='ffill', inplace=True)
# Perform seasonal decomposition
result = sm.tsa.seasonal_decompose(dat['y'], model='additive')
# Plot the trend component for the first 200 data points
result.trend.iloc[1:200].plot(figsize=(12, 6), title='Trend Component (First 200 Data Points)')
plt.xlabel('Date')
plt.ylabel('Trend')
plt.grid(True)
plt.tight_layout()
plt.show()
print("It is obvious to capture the trends here, along with time the incremental of stock price getting larger.\nAnd the trends is positive for sure!")
It is obvious to capture the trends here, along with time the incremental of stock price getting larger. And the trends is positive for sure!
result.seasonal.iloc[1:100].plot(figsize=(12, 6), title='Seasonal Component (First 100 Data Points)')
plt.xlabel('Date')
plt.ylabel('Seasonal Effect')
plt.grid(True)
plt.tight_layout()
plt.show()
print("It seems like there are no seasonal effects or very short piecies of period exiting seasonal trends")
It seems like there are no seasonal effects or very short piecies of period exiting seasonal trends
result.resid.iloc[1:].plot(figsize=(12, 6), title='Residuals Component (All Data Point)')
plt.xlabel('Date')
plt.ylabel('Residuals')
plt.grid(True)
plt.tight_layout()
plt.show()
print("We can see that the residuals magnifying as time goes by, which means the residuals is not stationary.\nIn other words, there are increasing anomalies when we using STD model to try to capture the real trends.")
We can see that the residuals magnifying as time goes by, which means the residuals is not stationary. In other words, there are increasing anomalies when we using STD model to try to capture the real trends.
from prophet import Prophet
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.offline as py
py.init_notebook_mode()
%matplotlib inline
dat = pd.read_csv('Amazon.csv')
dat = dat[['Date','Close']]
dat.columns = ['ds','y']
# Fitting with default parameters
dat_model_0 = Prophet(daily_seasonality=True)
dat_model_0.fit(dat)
print("I initialized a Facebook Prophet model with daily seasonality enabled, acknowledging the daily patterns in stock price fluctuations.\n")
print("This Prohet model can used to learn and capture the underlying patterns and trends in Amazon's stock prices for future forecasting and analysis.")
00:52:20 - cmdstanpy - INFO - Chain [1] start processing 00:52:22 - cmdstanpy - INFO - Chain [1] done processing
I initialized a Facebook Prophet model with daily seasonality enabled, acknowledging the daily patterns in stock price fluctuations. This Prohet model can used to learn and capture the underlying patterns and trends in Amazon's stock prices for future forecasting and analysis.
future= dat_model_0.make_future_dataframe(periods=20, freq='d')
future.tail()
print("By creating 20 future timestamp entries at daily intervals,\nwhich serves as a basis for conducting time series forecasting using the Facebook Prophet model.")
By creating 20 future timestamp entries at daily intervals, which serves as a basis for conducting time series forecasting using the Facebook Prophet model.
dat_model_0_data=dat_model_0.predict(future)
dat_model_0_data.tail()
| ds | trend | yhat_lower | yhat_upper | trend_lower | trend_upper | additive_terms | additive_terms_lower | additive_terms_upper | daily | ... | weekly | weekly_lower | weekly_upper | yearly | yearly_lower | yearly_upper | multiplicative_terms | multiplicative_terms_lower | multiplicative_terms_upper | yhat | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 6170 | 2021-11-12 | 3423.655898 | 3238.897681 | 3561.345435 | 3423.655898 | 3423.655898 | -20.297988 | -20.297988 | -20.297988 | -8.431807 | ... | -1.858536 | -1.858536 | -1.858536 | -10.007645 | -10.007645 | -10.007645 | 0.0 | 0.0 | 0.0 | 3403.357910 |
| 6171 | 2021-11-13 | 3425.180998 | 3247.565206 | 3572.819535 | 3425.180998 | 3425.180998 | -17.372189 | -17.372189 | -17.372189 | -8.431807 | ... | 1.053974 | 1.053974 | 1.053974 | -9.994356 | -9.994356 | -9.994356 | 0.0 | 0.0 | 0.0 | 3407.808809 |
| 6172 | 2021-11-14 | 3426.706097 | 3265.640672 | 3573.201261 | 3426.706097 | 3426.706097 | -17.304978 | -17.304978 | -17.304978 | -8.431807 | ... | 1.053974 | 1.053974 | 1.053974 | -9.927144 | -9.927144 | -9.927144 | 0.0 | 0.0 | 0.0 | 3409.401120 |
| 6173 | 2021-11-15 | 3428.231197 | 3234.355894 | 3579.547531 | 3428.231197 | 3428.231197 | -19.924146 | -19.924146 | -19.924146 | -8.431807 | ... | -1.679599 | -1.679599 | -1.679599 | -9.812740 | -9.812740 | -9.812740 | 0.0 | 0.0 | 0.0 | 3408.307051 |
| 6174 | 2021-11-16 | 3429.756297 | 3259.808963 | 3580.840221 | 3429.756297 | 3429.756297 | -17.863908 | -17.863908 | -17.863908 | -8.431807 | ... | 0.226597 | 0.226597 | 0.226597 | -9.658698 | -9.658698 | -9.658698 | 0.0 | 0.0 | 0.0 | 3411.892389 |
5 rows × 22 columns
future= dat_model_0.make_future_dataframe(periods=20, freq='d')
future.tail()
dat_model_0_data=dat_model_0.predict(future)
dat_model_0_data.tail()
from prophet.plot import add_changepoints_to_plot
# Create a Prophet model
dat_model_0 = Prophet()
# Fit the model to your data
dat_model_0.fit(dat)
# Create a future DataFrame for forecasting
future = dat_model_0.make_future_dataframe(periods=20, freq='D')
future.tail()
# Make predictions on the future DataFrame
dat_model_0_data = dat_model_0.predict(future)
dat_model_0_data.tail()
# Plot the forecast
fig = dat_model_0.plot(dat_model_0_data)
a = add_changepoints_to_plot(fig.gca(), dat_model_0, dat_model_0_data)
print("The main line in the graph represents the forecasted values of the time series data. This line provides predictions for future values based on historical patterns and the model's learned trends.\n")
print("The shaded areas represent uncertainty intervals, indicating the range within which the actual future values are likely to fall. The wider the uncertainty interval, the higher the uncertainty in the predictions.\n")
print("The points where vertical dashed lines intersect the time series line are potential changepoints. Changepoints are significant shifts in the Amazon's stock price.")
00:52:25 - cmdstanpy - INFO - Chain [1] start processing 00:52:27 - cmdstanpy - INFO - Chain [1] done processing
The main line in the graph represents the forecasted values of the time series data. This line provides predictions for future values based on historical patterns and the model's learned trends. The shaded areas represent uncertainty intervals, indicating the range within which the actual future values are likely to fall. The wider the uncertainty interval, the higher the uncertainty in the predictions. The points where vertical dashed lines intersect the time series line are potential changepoints. Changepoints are significant shifts in the Amazon's stock price.
dat_model_0.plot_components(dat_model_0_data)
Simple Moving Average (SMA): SMA is effective at identifying anomalies in Amazon's all-time stock price data because it smoothes out short-term fluctuations and highlights the underlying trend. Anomalies are often characterized by sudden, short-term deviations from the long-term upward trend. By comparing each data point to the moving average, SMA can readily flag periods when the stock price significantly deviates from the smoothed trend, making it a robust tool for identifying short-term anomalies.Exponential Smoothing: Exponential smoothing generates a smoothed forecast by giving more weight to recent data points. Anomalies are detected by comparing actual stock prices to forecasted values. By calculating the residuals which are differences between actual and forecasted values, it identifies anomalies as large positive or negative residuals, indicating unexpected deviations from the expected stock price trajectory. This model's sensitivity to long-term trends and gradual shifts makes it effective at spotting anomalies associated with sustained changes.Seasonal-Trend Decomposition (STL): STL decomposes the time series data into seasonal, trend, and residual components. It is effective at identifying anomalies because it explicitly separates seasonality and trend from the remainder (residuals). Unusual patterns or abrupt changes that don't conform to the expected seasonality or trend can be considered anomalies. STL helps differentiate between regular market behavior and irregular events impacting stock prices.Prophet Module: Prophet is designed to handle time series data with complex components like trends, seasonality, and holidays. Since Amazon's stock price exhibits an upward trend over the long term, Prophet is particularly effective at identifying anomalies that deviate from this. It identifies anomalies by comparing observed stock prices to forecasted values and their associated prediction intervals. Anomalies are detected when observed values fall outside these intervals. This approach is robust for capturing complex patterns and anomalies within the context of the upward trend.